This is a document to showcase the analysis done for the project Expedite. It was undertaken by the Computational Biology Facility (https://www.liverpool.ac.uk/computational-biology-facility/) at the University of Liverpool. The data has been provided by Dr Christopher Hill, a post-doctoral researcher based at the Liverpool Women’s Hospital and consists of metadata (anonymised patient information), levels of progesterone and HCG and normalised batch corrected NMR metabolomics data (the later consisting of peak abundances - some of those peaks are associated to a known metabolite but multiple are still unidentified).
As a first step we load required functions.
source("ExpediteFunctions.R")
We load the data and do some prelim visualizations and checks.
exp_Combat <- read_excel("EXPEDITE_Data_Aug23.xlsx", sheet = 2)
# note that the data above contain metadata and data. Metadata is in the first
# 9 columns note that after exploration both HCG and Progesterone do not seem
# to be log2 so they will be transformed
exp_Combat$HCG <- log2(exp_Combat$HCG)
exp_Combat$Progesterone <- log2(exp_Combat$Progesterone)
# visualisations with PCA to appraise possible structure. The batch correction
# proved effective but no links to other structure can be made
do_PCA_Plot_Nov16(exp_Combat[, c(4, 5, 10:ncol(exp_Combat))], exp_Combat$Group, scale = F,
legendName = "", colores = c("olivedrab", "orange", "steelblue", "darkgrey"))
do_PCA_Plot_Nov16(exp_Combat[, c(4, 5, 10:ncol(exp_Combat))], exp_Combat$Batch, scale = F,
legendName = "")
# assessing missingness
apply(exp_Combat, 2, function(x) sum(is.na(x)))[apply(exp_Combat, 2, function(x) sum(is.na(x))) !=
0]
## HCG Progesterone
## 9 8
## Gestation at consent (weeks) BMI
## 2 3
We have identified that there are four variables with missing values. HCG, Progesterone, Gestation at consent and BMI.For this first analysis we will not include patients with missing value (losing 14 total). For the time being we excluded these patients from the study although in the future we could try imputation.
We will build univariate models to determine which variables are important to explain differences between groups. To select the best models to build and which covariates to include, we will do an assessment of variance.
We aim to try to find markers between all the different diagnosis but this may not be possible so we will also test whether we can find markers between normal and other diagnosis. For this we would need a dummy group
bimodalGroup <- exp_Combat$Group
bimodalGroup[!bimodalGroup %in% c("LNSP")] <- "Other"
bimodalGroup <- factor(bimodalGroup, ordered = T, levels = c("LNSP", "Other"))
As our aim is to try to find a composite biomarker using a machine learning approach, to follow best practices and avoid data leakage we will be undertaken the Univariate Statistics on the data split that will serve for variable selection leaving an untouched part of data for validation
# create dataset with the key variables for modelling, HCG and progesterone
# plus the metabolomics we take complete observations for this as this is a
# first pass for the analysis. Inputation may be consider at a later point.
# Analysis using the data that has gone through Combat
set.seed(800)
data_for_limma <- data.frame(Group = exp_Combat$Group, BiGroup = bimodalGroup, BMI = exp_Combat$BMI,
Smoker = exp_Combat$`Smoker (Y/N)`, Gestation = exp_Combat$`Gestation at consent (weeks)`,
Age = exp_Combat$`Age (years)`, exp_Combat[, c(4, 5, 10:ncol(exp_Combat))])
data_for_limma <- data_for_limma[complete.cases(data_for_limma), ]
data_for_limma$Group <- factor(data_for_limma$Group, levels = c("LNSP", "Miscarriage",
"tEP", "PUL"))
numeric_data <- data_for_limma[, 7:ncol(data_for_limma)]
form <- ~(1 | Group) + BMI + (1 | Smoker) + Gestation + Age
# note as per package recommendation, all categorical variables are modeled as
# random effects with continuous as fixed effects
varPart <- fitExtractVarPartModel(t(numeric_data), form, data_for_limma[, 1:6])
vp <- sortCols(varPart, last = "Residuals")
vp_bars <- plotPercentBars(vp[1:30, ]) + theme(legend.position = "bottom") #top 30 so we can see some detail
vp_violin <- plotVarPart(vp)
ggarrange(vp_bars, vp_violin, labels = c("A", "B"), ncol = 2, nrow = 1)
While unfortunately the majority of the variance is not explained by one of the factors captured we can see that Age, BMI and Group are the most important factors, followed by Gestation. The smoking status does not seem to have a relevant weight and as such we will not include it in the models.
This step will be undertaken using data from metabolites + HCG +Progesterone
# creating the firs split to keep some data for further validation ----
set.seed(800)
# first for multigroup
dataSplit_forValidation <- applyPrepData(data_train = numeric_data, prepData = prepData,
factor_var = data_for_limma$Group, split = 0.8, n = 1)
# indeces
tr_idx_mg <- dataSplit_forValidation[[1]]$idx
# then for 2 group
dataSplit_forValidation_2group <- applyPrepData(data_train = numeric_data, prepData = prepData,
factor_var = data_for_limma$BiGroup, split = 0.8, n = 1)
tr_idx_2g <- dataSplit_forValidation_2group[[1]]$idx
# these lists will be used in the multivariate part but for the univariate part
# we only need the indices in the tr_idx variables
design_MultGroups <- model.matrix(~0 + Group + BMI + Gestation + Age, data = data_for_limma[tr_idx_mg,
])
design_2groups <- model.matrix(~0 + BiGroup + BMI + Gestation + Age, data = data_for_limma[tr_idx_2g,
])
colnames(design_MultGroups) <- c("LNSP", "Miscarriage", "tEP", "PUL", "BMI", "Gestation",
"Age")
colnames(design_2groups) <- c("LNSP", "Other", "BMI", "Gestation", "Age")
fit_MultGroups <- lmFit(t(numeric_data[tr_idx_mg, ]), design_MultGroups)
fit_2groups <- lmFit(t(numeric_data[tr_idx_2g, ]), design_2groups)
# Contrasts for the multigroup comparison ----
contrasts_MultGroups <- makeContrasts(Miscarriage - LNSP, tEP - LNSP, PUL - LNSP,
Miscarriage - tEP, levels = design_MultGroups)
contrast_fit_MultGroups <- contrasts.fit(fit_MultGroups, contrasts_MultGroups)
contrast_ebayes_MultGroups <- eBayes(contrast_fit_MultGroups)
results_Miscarriage_LNSP <- data.frame(topTable(contrast_ebayes_MultGroups, coef = 1,
adjust = "fdr", number = ncol(exp_Combat), p.value = 0.05))
results_tEP_LNSP <- data.frame(topTable(contrast_ebayes_MultGroups, coef = 2, adjust = "fdr",
number = ncol(exp_Combat), p.value = 0.05))
results_PUL_LNSP <- data.frame(topTable(contrast_ebayes_MultGroups, coef = 3, adjust = "fdr",
number = ncol(exp_Combat), p.value = 0.05))
results_Miscarriage_tEP <- data.frame(topTable(contrast_ebayes_MultGroups, coef = 4,
adjust = "fdr", number = ncol(exp_Combat), p.value = 0.05)) #this results in none so we do not include it later
results_Miscarriage_LNSP$Metabolite <- rownames(results_Miscarriage_LNSP)
results_Miscarriage_LNSP$Comparison <- "Miscarriage_LNSP"
results_tEP_LNSP$Metabolite <- rownames(results_tEP_LNSP)
results_tEP_LNSP$Comparison <- "tEP_LNSP"
results_PUL_LNSP$Metabolite <- rownames(results_PUL_LNSP)
results_PUL_LNSP$Comparison <- "PUL_LNSP"
# results_Miscarriage_tEP$Metabolite<-rownames(results_Miscarriage_tEP)
# results_Miscarriage_tEP$Comparison<-'Miscarriage_tEP'
results <- rbind(results_Miscarriage_LNSP, results_tEP_LNSP, results_PUL_LNSP) #,results_Miscarriage_tEP)
results
write.csv(results, "UnivarResults_CombatData_MultiGroupComparison.csv")
unique(results$Metabolite)
## [1] "HCG" "Glutamine_115"
## [3] "Progesterone" "phenylalanine_11"
## [5] "Arginine_122" "Alanine_128"
## [7] "unknown_31" "X2_hydroxybutyrate_146"
## [9] "Unknown_54" "acetate_121"
## [11] "creatinine_38" "Unknown_52"
## [13] "Unknown_27" "proline_107"
## [15] "proline_110" "unknown_32"
## [17] "X2_hydroxyisovalerate_148" "unknown_30"
## [19] "unknown_120" "unknown_29"
## [21] "LDL_150" "unknown_99"
## [23] "tyrosine_20"
s <- length(unique(results$Metabolite))
# contrasts for the 2-group comparison ----
# Contrasts for the multigroup comparison ----
contrasts_2groups <- makeContrasts(Other - LNSP, levels = design_2groups)
contrast_fit_2groups <- contrasts.fit(fit_2groups, contrasts_2groups)
contrast_ebayes_2groups <- eBayes(contrast_fit_2groups)
# if I run the eBayes without the contrasts bit I will get info on the effect
# of the confounders. density plots of the different confounders
results_Other_LNSP <- data.frame(topTable(contrast_ebayes_2groups, coef = 1, adjust = "fdr",
number = ncol(exp_Combat), p.value = 0.05))
nrow(results_Other_LNSP) #number of significant metabolites
## [1] 17
results_Other_LNSP$Metabolite <- rownames(results_Other_LNSP)
results_Other_LNSP$Comparison <- "Other_LNSP"
results_Other_LNSP
write.csv(results_Other_LNSP, "UnivarResults_CombatData_2GroupComparison.csv")
unique(results_Other_LNSP$Metabolite)
## [1] "HCG" "Glutamine_115" "Progesterone" "Alanine_128"
## [5] "phenylalanine_11" "unknown_31" "unknown_88" "unknown_30"
## [9] "unknown_29" "unknown_102" "unknown_130" "Arginine_122"
## [13] "Unknown_54" "glutamate_111" "unknown_120" "unknown_57"
## [17] "acetate_121"
s2 <- length(unique(results_Other_LNSP$Metabolite))
In total we have found 23 potential candidates as biomarkers for the multigroup comparisons and 17 for the 2 group comparison. Below are two PCAs to see if the structure of the data has improved but we cannot really say it has.
# multigroup comparison
do_PCA_Plot_Nov16(exp_Combat[, unique(results$Metabolite)], exp_Combat$Group, scale = F,
legendName = "")
do_PCA_Plot_Nov16(exp_Combat[, unique(results_Other_LNSP$Metabolite)], bimodalGroup,
scale = F, legendName = "")
# two group comparison
Now using those variables we want to see if we can find a suitable composite biomarker. We will bring the significant signals forward to do further selection.
We will apply an 10-fold cross-validation system to select variables. We will do so by splitting the data in different slices and finding consensus variables. The variable selection will be undertaken using Lasso selection. We will undertake this process for multigroup and 2-group comparisons.
univarRes<-list("Comp_Multi"=unique(results$Metabolite),
"Comp_2group"=unique(results_Other_LNSP$Metabolite))
venn.diagram(univarRes,filename = "VennDiagramUnivarResults.png")
## [1] 1
display_venn <- function(x, ...){
library(VennDiagram)
grid.newpage()
venn_object <- venn.diagram(x, filename = NULL, ...)
grid.draw(venn_object)
}
display_venn(univarRes,fill=c("cadetblue2","salmon3"))
In the Venn diagram above we can see the overlap of the significant variables for each dataset and model. Below, the identities for those signals are shown. For this step we will leave HCG and Progesterone out and we will focus on refining tthe list of metabolites to include in the model.
univarResDF <- rbind(data.frame(Metab = unique(results$Metabolite), Comp = "Combat_Multi"),
data.frame(Metab = unique(results_Other_LNSP$Metabolite), Comp = "Combat_2group"))
table(univarResDF$Metab, univarResDF$Comp)
##
## Combat_2group Combat_Multi
## acetate_121 1 1
## Alanine_128 1 1
## Arginine_122 1 1
## creatinine_38 0 1
## glutamate_111 1 0
## Glutamine_115 1 1
## HCG 1 1
## LDL_150 0 1
## phenylalanine_11 1 1
## Progesterone 1 1
## proline_107 0 1
## proline_110 0 1
## tyrosine_20 0 1
## unknown_102 1 0
## unknown_120 1 1
## unknown_130 1 0
## Unknown_27 0 1
## unknown_29 1 1
## unknown_30 1 1
## unknown_31 1 1
## unknown_32 0 1
## Unknown_52 0 1
## Unknown_54 1 1
## unknown_57 1 0
## unknown_88 1 0
## unknown_99 0 1
## X2_hydroxybutyrate_146 0 1
## X2_hydroxyisovalerate_148 0 1
tableUnivarRes <- as.matrix(table(univarResDF$Metab, univarResDF$Comp))
tableUnivarRes_sum <- data.frame(Metab = rownames(tableUnivarRes), Count = rowSums(tableUnivarRes))
tableUnivarRes_sum[order(tableUnivarRes_sum[, 2], decreasing = T), ]
univarResDF_noHCG_Prog <- univarResDF[!univarResDF$Metab %in% c("HCG", "Progesterone"),
]
In the steps above, the 10-fold cross validation is applied to select variables. Note that each fold will undergo a selection on 100 models, and only if selected in at least 80% of the models the variable is considered as a candidate stable marker.
We will undertake this process without HCG and progesterone with the aim of selecting the most predictive metabolites and assess their predictive power overall
# creating data splits for N-Fold cross validation ---- for multigroup
dataSplits <- applyPrepData(data_train = dataSplit_forValidation[[1]]$train[, colnames(dataSplit_forValidation[[1]]$train) %in%
univarResDF_noHCG_Prog[univarResDF_noHCG_Prog$Comp == "Combat_Multi", "Metab"]],
prepData = prepData, factor_var = dataSplit_forValidation[[1]]$factor_train,
split = 0.9, n = 10)
# for 2 groups
dataSplits_2groups <- applyPrepData(data_train = dataSplit_forValidation_2group[[1]]$train[,
colnames(dataSplit_forValidation_2group[[1]]$train) %in% univarResDF_noHCG_Prog[univarResDF_noHCG_Prog$Comp ==
"Combat_2group", "Metab"]], prepData = prepData, factor_var = dataSplit_forValidation_2group[[1]]$factor_train,
split = 0.9, n = 10)
# applying the Variable selection in the 10 data splits created with Lasso----
# for multigroup
exp_LassoRes_multiGroup <- applyMultiVarSelectionForCrossVal_withModel(inputListwSplits = dataSplits,
functionSelection = multimodalLasso, thresMod = 80, ntree = 500)
# for 2groups
exp_LassoRes_2groups <- applyMultiVarSelectionForCrossVal_withModel(inputListwSplits = dataSplits_2groups,
functionSelection = TwoClassLasso, thresMod = 80, ntree = 500)
Note that below we are presenting each metabolite signal and the number of times that it passed the 80% threshold accross the 10 data folds. Then we use the most consistent signals to visualise potential structure using PCA.
# Check the frequency certain variables are selected on the different datasets,
# the aim is to select those that are selected in a high number of data slices
exp_LassoRes_concRes_multiGroup <- as.data.frame(table(concatenateVarSelec(exp_LassoRes_multiGroup)))
exp_LassoRes_concRes_multiGroup
exp_LassoRes_concRes_2groups <- as.data.frame(table(concatenateVarSelec(exp_LassoRes_2groups)))
exp_LassoRes_concRes_2groups
# Visualising structure using all the dataset with variables that appear in at
# least 8 data slices plus some key others such as BMI, gestation and the
# levels of progesterone and HCG. While no clear structure is seen in the PCA,
# we can see that the group with the worse prognosis (Other) presents some
# differences accoss PC1 with respect to LNSP
# PCA using only selected metabolite signals for multiple groups
do_PCA_Plot_Nov16(numeric_data[, as.character(exp_LassoRes_concRes_multiGroup[exp_LassoRes_concRes_multiGroup$Freq >
8, 1])], data_for_limma$Group, legendName = "", colores = c("olivedrab", "orange",
"steelblue", "darkgrey"))
# PCA using selected metabolite signals+BMI+Gestation+AgeProgesterone+HCG for
# multiple groups
do_PCA_Plot_Nov16(data.frame(numeric_data[, as.character(exp_LassoRes_concRes_multiGroup[exp_LassoRes_concRes_multiGroup$Freq >
8, 1])], data_for_limma$BMI, data_for_limma$Gestation, data_for_limma$Age, data_for_limma$Progesterone,
data_for_limma$HCG), data_for_limma$Group, legendName = "", colores = c("olivedrab",
"orange", "steelblue", "darkgrey"))
# PCA using only selected metabolite signals for 2group
BiGroupCAO <- factor(gsub("Other", "CAO", data_for_limma$BiGroup), ordered = T, levels = c("LNSP",
"CAO"))
do_PCA_Plot_Nov16(data = numeric_data[, as.character(exp_LassoRes_concRes_2groups[exp_LassoRes_concRes_2groups$Freq >
6, 1])], groups = BiGroupCAO, legendName = "", colores = c("green4", "darkmagenta"))
# PCA using selected metabolite signals+BMI+Gestation+Age+Progesterone+HCG for
# 2group comparison
do_PCA_Plot_Nov16(data.frame(numeric_data[, as.character(exp_LassoRes_concRes_2groups[exp_LassoRes_concRes_2groups$Freq >
6, 1])], data_for_limma$BMI, data_for_limma$Gestation, data_for_limma$Age, data_for_limma$Progesterone,
data_for_limma$HCG), BiGroupCAO, legendName = "", colores = c("green4", "darkmagenta"))
# PCA using selected metabolite signals+BMI+Gestation+Age+Progesterone+HCG for
# 2group comparison - further metabolite selection (8 slices)
do_PCA_Plot_Nov16(data.frame(numeric_data[, as.character(exp_LassoRes_concRes_2groups[exp_LassoRes_concRes_2groups$Freq >
8, 1])], data_for_limma$BMI, data_for_limma$Gestation, data_for_limma$Age, data_for_limma$Progesterone,
data_for_limma$HCG), BiGroupCAO, legendName = "", colores = c("green4", "darkmagenta"))
# Assessing performance for different selection ----
# for multigroup
exp_LassoRes_multiGroup_modelMetExtract <- modelMetrExtract(exp_LassoRes_multiGroup)
write.csv(exp_LassoRes_multiGroup_modelMetExtract, "ModelMetrics_DifferentVariables_10Slices_Multigroup.csv")
# for 2group
exp_LassoRes_2groups_modelMetExtract <- modelMetrExtract(exp_LassoRes_2groups)
write.csv(exp_LassoRes_2groups_modelMetExtract, "ModelMetrics_DifferentVariables_10Slices_2groups.csv")
print(exp_LassoRes_multiGroup_modelMetExtract[, c(1, 2, 3, 4, 7, 8, 9, 13)], row.names = F)
## vars
## phenylalanine_11 tyrosine_20 Unknown_27 unknown_29 unknown_30 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128 X2_hydroxyisovalerate_148
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 creatinine_38 Unknown_54 unknown_99 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128 X2_hydroxyisovalerate_148
## phenylalanine_11 Unknown_27 unknown_29 unknown_30 unknown_31 unknown_32 creatinine_38 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_29 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128 LDL_150
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 creatinine_38 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 tyrosine_20 Unknown_27 unknown_29 unknown_30 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128 X2_hydroxyisovalerate_148
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 creatinine_38 Unknown_54 unknown_99 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128 X2_hydroxyisovalerate_148
## phenylalanine_11 Unknown_27 unknown_29 unknown_30 unknown_31 unknown_32 creatinine_38 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_29 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128 LDL_150
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 creatinine_38 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 tyrosine_20 Unknown_27 unknown_29 unknown_30 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128 X2_hydroxyisovalerate_148
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 creatinine_38 Unknown_54 unknown_99 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128 X2_hydroxyisovalerate_148
## phenylalanine_11 Unknown_27 unknown_29 unknown_30 unknown_31 unknown_32 creatinine_38 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_29 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128 LDL_150
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 creatinine_38 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 tyrosine_20 Unknown_27 unknown_29 unknown_30 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128 X2_hydroxyisovalerate_148
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 creatinine_38 Unknown_54 unknown_99 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128 X2_hydroxyisovalerate_148
## phenylalanine_11 Unknown_27 unknown_29 unknown_30 unknown_31 unknown_32 creatinine_38 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_29 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128 LDL_150
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 creatinine_38 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## phenylalanine_11 Unknown_27 unknown_30 unknown_31 Unknown_54 proline_107 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## Class Sensitivity Specificity Precision Recall F1
## LNSP 0.9166667 0.5384615 0.6470588 0.9166667 0.7586207
## Miscarriage 0.3333333 0.8947368 0.5000000 0.3333333 0.4000000
## tEP 0.2500000 1.0000000 1.0000000 0.2500000 0.4000000
## PUL 0.0000000 0.8636364 0.0000000 0.0000000 NaN
## LNSP 0.9166667 0.6923077 0.7333333 0.9166667 0.8148148
## Miscarriage 0.3333333 0.8421053 0.4000000 0.3333333 0.3636364
## tEP 0.5000000 1.0000000 1.0000000 0.5000000 0.6666667
## PUL 0.6666667 0.9545455 0.6666667 0.6666667 0.6666667
## LNSP 0.7500000 0.5384615 0.6000000 0.7500000 0.6666667
## Miscarriage 0.1666667 0.7894737 0.2000000 0.1666667 0.1818182
## tEP 0.2500000 0.9047619 0.3333333 0.2500000 0.2857143
## PUL 0.3333333 0.9545455 0.5000000 0.3333333 0.4000000
## LNSP 0.9166667 0.4615385 0.6111111 0.9166667 0.7333333
## Miscarriage 0.3333333 0.8947368 0.5000000 0.3333333 0.4000000
## tEP 0.0000000 0.9047619 0.0000000 0.0000000 NaN
## PUL 0.0000000 0.9545455 0.0000000 0.0000000 NaN
## LNSP 0.7500000 0.4615385 0.5625000 0.7500000 0.6428571
## Miscarriage 0.5000000 0.7894737 0.4285714 0.5000000 0.4615385
## tEP 0.0000000 0.9523810 0.0000000 0.0000000 NaN
## PUL 0.3333333 1.0000000 1.0000000 0.3333333 0.5000000
## LNSP 1.0000000 0.5384615 0.6666667 1.0000000 0.8000000
## Miscarriage 0.0000000 0.8947368 0.0000000 0.0000000 NaN
## tEP 0.0000000 0.9523810 0.0000000 0.0000000 NaN
## PUL 0.6666667 0.9090909 0.5000000 0.6666667 0.5714286
## LNSP 0.9166667 0.3076923 0.5500000 0.9166667 0.6875000
## Miscarriage 0.0000000 0.7894737 0.0000000 0.0000000 NaN
## tEP 0.2500000 1.0000000 1.0000000 0.2500000 0.4000000
## PUL 0.0000000 1.0000000 NA 0.0000000 NA
## LNSP 0.9166667 0.5384615 0.6470588 0.9166667 0.7586207
## Miscarriage 0.1666667 0.7894737 0.2000000 0.1666667 0.1818182
## tEP 0.2500000 0.9523810 0.5000000 0.2500000 0.3333333
## PUL 0.0000000 0.9545455 0.0000000 0.0000000 NaN
## LNSP 0.8333333 0.3076923 0.5263158 0.8333333 0.6451613
## Miscarriage 0.1666667 0.8947368 0.3333333 0.1666667 0.2222222
## tEP 0.2500000 1.0000000 1.0000000 0.2500000 0.4000000
## PUL 0.3333333 0.9545455 0.5000000 0.3333333 0.4000000
## LNSP 1.0000000 0.6153846 0.7058824 1.0000000 0.8275862
## Miscarriage 0.3333333 0.8947368 0.5000000 0.3333333 0.4000000
## tEP 0.0000000 0.9523810 0.0000000 0.0000000 NaN
## PUL 0.6666667 0.9545455 0.6666667 0.6666667 0.6666667
## Balanced.Accuracy
## 0.7275641
## 0.6140351
## 0.6250000
## 0.4318182
## 0.8044872
## 0.5877193
## 0.7500000
## 0.8106061
## 0.6442308
## 0.4780702
## 0.5773810
## 0.6439394
## 0.6891026
## 0.6140351
## 0.4523810
## 0.4772727
## 0.6057692
## 0.6447368
## 0.4761905
## 0.6666667
## 0.7692308
## 0.4473684
## 0.4761905
## 0.7878788
## 0.6121795
## 0.3947368
## 0.6250000
## 0.5000000
## 0.7275641
## 0.4780702
## 0.6011905
## 0.4772727
## 0.5705128
## 0.5307018
## 0.6250000
## 0.6439394
## 0.8076923
## 0.6140351
## 0.4761905
## 0.8106061
print(exp_LassoRes_2groups_modelMetExtract[, c(1, 2, 3, 6, 7, 8, 12)])
## vars
## 1 phenylalanine_11 unknown_30 unknown_31 Unknown_54 glutamate_111 Glutamine_115 Arginine_122
## 2 phenylalanine_11 unknown_30 Unknown_54 unknown_88 glutamate_111 Glutamine_115 acetate_121 Alanine_128 unknown_130
## 3 phenylalanine_11 unknown_30 Unknown_54 unknown_88 glutamate_111 Glutamine_115 acetate_121 Arginine_122 Alanine_128 unknown_130
## 4 phenylalanine_11 unknown_30 Unknown_54 glutamate_111 Glutamine_115 acetate_121 Arginine_122 Alanine_128 unknown_130
## 5 phenylalanine_11 unknown_30 Unknown_54 unknown_88 Glutamine_115 unknown_120 acetate_121 Alanine_128 unknown_130
## 6 phenylalanine_11 unknown_30 Unknown_54 unknown_88 Glutamine_115 unknown_120 Arginine_122 Alanine_128 unknown_130
## 7 phenylalanine_11 unknown_30 unknown_31 Unknown_54 Glutamine_115 Alanine_128 unknown_130
## 8 phenylalanine_11 unknown_30 Unknown_54 unknown_88 unknown_102 glutamate_111 Glutamine_115 acetate_121 Arginine_122 Alanine_128 unknown_130
## 9 phenylalanine_11 unknown_30 Unknown_54 glutamate_111 Glutamine_115 acetate_121 Arginine_122 Alanine_128
## 10 phenylalanine_11 unknown_30 unknown_31 Unknown_54 unknown_88 glutamate_111 Glutamine_115 acetate_121 Arginine_122 Alanine_128 unknown_130
## Sensitivity Specificity Precision Recall F1 Balanced.Accuracy
## 1 0.6666667 0.7142857 0.6666667 0.6666667 0.6666667 0.6904762
## 2 0.4166667 0.7142857 0.5555556 0.4166667 0.4761905 0.5654762
## 3 0.6666667 0.7142857 0.6666667 0.6666667 0.6666667 0.6904762
## 4 0.6666667 0.7142857 0.6666667 0.6666667 0.6666667 0.6904762
## 5 0.5833333 0.6428571 0.5833333 0.5833333 0.5833333 0.6130952
## 6 0.7500000 0.6428571 0.6428571 0.7500000 0.6923077 0.6964286
## 7 0.6666667 0.7142857 0.6666667 0.6666667 0.6666667 0.6904762
## 8 0.7500000 0.7857143 0.7500000 0.7500000 0.7500000 0.7678571
## 9 0.8333333 0.8571429 0.8333333 0.8333333 0.8333333 0.8452381
## 10 0.9166667 0.8571429 0.8461538 0.9166667 0.8800000 0.8869048
Above we could see that the models generated are not excellent. Quite variable with best performances predicting with circa 70% balance accuracy. However these models are using solely the metabolite fingerprint to predict.
# Models on the train/test splits ---- selecting signals that have been
# selected in at least 7 data slices
final_varsSelected_multigroup <- as.character(exp_LassoRes_concRes_multiGroup[exp_LassoRes_concRes_multiGroup$Freq >
6, 1])
final_varsSelected_multigroup
## [1] "acetate_121" "Alanine_128" "Arginine_122" "Glutamine_115"
## [5] "phenylalanine_11" "proline_107" "Unknown_27" "unknown_30"
## [9] "unknown_31" "Unknown_54"
final_varsSelected_2groups <- as.character(exp_LassoRes_concRes_2groups[exp_LassoRes_concRes_2groups$Freq >
6, 1])
final_varsSelected_2groups
## [1] "acetate_121" "Alanine_128" "Arginine_122" "glutamate_111"
## [5] "Glutamine_115" "phenylalanine_11" "unknown_130" "unknown_30"
## [9] "Unknown_54"
Predict_TrainDataSplits_multiGroup <- applyModelsToDataSplits(inputListwSplits = dataSplits,
varsForModel = final_varsSelected_multigroup, ntree = 500)
write.csv(modelMetrExtract(Predict_TrainDataSplits_multiGroup), "Predict_TrainDataSplits_multigroup.csv")
Predict_TrainDataSplits_2groups <- applyModelsToDataSplits(inputListwSplits = dataSplits_2groups,
varsForModel = final_varsSelected_2groups, ntree = 500)
write.csv(modelMetrExtract(Predict_TrainDataSplits_2groups), "Predict_TrainDataSplits_2groups.csv")
# Build final model ----
set.seed(800)
finalmodel_multigroup <- RandomForestWithStats(train = dataSplit_forValidation[[1]]$train[,
colnames(dataSplit_forValidation[[1]]$train) %in% final_varsSelected_multigroup],
test = dataSplit_forValidation[[1]]$test[, colnames(dataSplit_forValidation[[1]]$test) %in%
final_varsSelected_multigroup], factor_train = dataSplit_forValidation[[1]]$factor_train,
factor_test = dataSplit_forValidation[[1]]$factor_test, ntree = 500)
confusionMatrix(finalmodel_multigroup$Predictions, dataSplit_forValidation[[1]]$factor_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction LNSP Miscarriage tEP PUL
## LNSP 24 4 5 3
## Miscarriage 4 5 3 2
## tEP 1 3 1 3
## PUL 0 3 1 0
##
## Overall Statistics
##
## Accuracy : 0.4839
## 95% CI : (0.355, 0.6144)
## No Information Rate : 0.4677
## P-Value [Acc > NIR] : 0.4484
##
## Kappa : 0.1994
##
## Mcnemar's Test P-Value : 0.3334
##
## Statistics by Class:
##
## Class: LNSP Class: Miscarriage Class: tEP Class: PUL
## Sensitivity 0.8276 0.33333 0.10000 0.00000
## Specificity 0.6364 0.80851 0.86538 0.92593
## Pos Pred Value 0.6667 0.35714 0.12500 0.00000
## Neg Pred Value 0.8077 0.79167 0.83333 0.86207
## Prevalence 0.4677 0.24194 0.16129 0.12903
## Detection Rate 0.3871 0.08065 0.01613 0.00000
## Detection Prevalence 0.5806 0.22581 0.12903 0.06452
## Balanced Accuracy 0.7320 0.57092 0.48269 0.46296
# as expected the multigroup is not a good model
# multigroup including also HCG and Progesterone
finalmodel_multigroupPlusHCGPROG <- RandomForestWithStats(train = dataSplit_forValidation[[1]]$train[,
colnames(dataSplit_forValidation[[1]]$train) %in% c(final_varsSelected_multigroup,
"HCG", "Progesterone")], test = dataSplit_forValidation[[1]]$test[, colnames(dataSplit_forValidation[[1]]$test) %in%
c(final_varsSelected_multigroup, "HCG", "Progesterone")], factor_train = dataSplit_forValidation[[1]]$factor_train,
factor_test = dataSplit_forValidation[[1]]$factor_test, ntree = 500)
confusionMatrix(finalmodel_multigroupPlusHCGPROG$Predictions, dataSplit_forValidation[[1]]$factor_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction LNSP Miscarriage tEP PUL
## LNSP 27 4 6 3
## Miscarriage 2 6 3 3
## tEP 0 1 1 1
## PUL 0 4 0 1
##
## Overall Statistics
##
## Accuracy : 0.5645
## 95% CI : (0.4326, 0.6901)
## No Information Rate : 0.4677
## P-Value [Acc > NIR] : 0.08089
##
## Kappa : 0.3037
##
## Mcnemar's Test P-Value : 0.06636
##
## Statistics by Class:
##
## Class: LNSP Class: Miscarriage Class: tEP Class: PUL
## Sensitivity 0.9310 0.40000 0.10000 0.12500
## Specificity 0.6061 0.82979 0.96154 0.92593
## Pos Pred Value 0.6750 0.42857 0.33333 0.20000
## Neg Pred Value 0.9091 0.81250 0.84746 0.87719
## Prevalence 0.4677 0.24194 0.16129 0.12903
## Detection Rate 0.4355 0.09677 0.01613 0.01613
## Detection Prevalence 0.6452 0.22581 0.04839 0.08065
## Balanced Accuracy 0.7685 0.61489 0.53077 0.52546
# multigroup including also HCG and Progesterone and metadata
finalmodel_multigroupPlusHCGPROGandMetadata <- RandomForestWithStats(train = cbind(dataSplit_forValidation[[1]]$train[,
colnames(dataSplit_forValidation[[1]]$train) %in% c(final_varsSelected_multigroup,
"HCG", "Progesterone")], data_for_limma[dataSplit_forValidation[[1]]$idx,
c("BMI", "Age", "Gestation")]), test = cbind(dataSplit_forValidation[[1]]$test[,
colnames(dataSplit_forValidation[[1]]$test) %in% c(final_varsSelected_multigroup,
"HCG", "Progesterone")], data_for_limma[-dataSplit_forValidation[[1]]$idx,
c("BMI", "Age", "Gestation")]), factor_train = dataSplit_forValidation[[1]]$factor_train,
factor_test = dataSplit_forValidation[[1]]$factor_test, ntree = 500)
confusionMatrix(finalmodel_multigroupPlusHCGPROGandMetadata$Predictions, dataSplit_forValidation[[1]]$factor_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction LNSP Miscarriage tEP PUL
## LNSP 28 3 6 2
## Miscarriage 0 6 4 3
## tEP 1 2 0 1
## PUL 0 4 0 2
##
## Overall Statistics
##
## Accuracy : 0.5806
## 95% CI : (0.4485, 0.7049)
## No Information Rate : 0.4677
## P-Value [Acc > NIR] : 0.04911
##
## Kappa : 0.3366
##
## Mcnemar's Test P-Value : 0.10950
##
## Statistics by Class:
##
## Class: LNSP Class: Miscarriage Class: tEP Class: PUL
## Sensitivity 0.9655 0.40000 0.00000 0.25000
## Specificity 0.6667 0.85106 0.92308 0.92593
## Pos Pred Value 0.7179 0.46154 0.00000 0.33333
## Neg Pred Value 0.9565 0.81633 0.82759 0.89286
## Prevalence 0.4677 0.24194 0.16129 0.12903
## Detection Rate 0.4516 0.09677 0.00000 0.03226
## Detection Prevalence 0.6290 0.20968 0.06452 0.09677
## Balanced Accuracy 0.8161 0.62553 0.46154 0.58796
# multigroup no metabolites, only HCG and Progesterone and metadata
finalmodel_multigroupONLY_HCGPROGandMetadata <- RandomForestWithStats(train = cbind(dataSplit_forValidation[[1]]$train[,
colnames(dataSplit_forValidation[[1]]$train) %in% c("HCG", "Progesterone")],
data_for_limma[dataSplit_forValidation[[1]]$idx, c("BMI", "Age", "Gestation")]),
test = cbind(dataSplit_forValidation[[1]]$test[, colnames(dataSplit_forValidation[[1]]$test) %in%
c("HCG", "Progesterone")], data_for_limma[-dataSplit_forValidation[[1]]$idx,
c("BMI", "Age", "Gestation")]), factor_train = dataSplit_forValidation[[1]]$factor_train,
factor_test = dataSplit_forValidation[[1]]$factor_test, ntree = 500)
confusionMatrix(finalmodel_multigroupONLY_HCGPROGandMetadata$Predictions, dataSplit_forValidation[[1]]$factor_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction LNSP Miscarriage tEP PUL
## LNSP 28 5 4 3
## Miscarriage 1 4 4 3
## tEP 0 1 0 0
## PUL 0 5 2 2
##
## Overall Statistics
##
## Accuracy : 0.5484
## 95% CI : (0.4168, 0.6752)
## No Information Rate : 0.4677
## P-Value [Acc > NIR] : 0.12610
##
## Kappa : 0.2832
##
## Mcnemar's Test P-Value : 0.03001
##
## Statistics by Class:
##
## Class: LNSP Class: Miscarriage Class: tEP Class: PUL
## Sensitivity 0.9655 0.26667 0.00000 0.25000
## Specificity 0.6364 0.82979 0.98077 0.87037
## Pos Pred Value 0.7000 0.33333 0.00000 0.22222
## Neg Pred Value 0.9545 0.78000 0.83607 0.88679
## Prevalence 0.4677 0.24194 0.16129 0.12903
## Detection Rate 0.4516 0.06452 0.00000 0.03226
## Detection Prevalence 0.6452 0.19355 0.01613 0.14516
## Balanced Accuracy 0.8009 0.54823 0.49038 0.56019
# two groups (LNSP vs Other) first model using only selected metabolites
set.seed(800)
finalmodel_2groups <- RandomForestWithStats(train = dataSplit_forValidation_2group[[1]]$train[,
colnames(dataSplit_forValidation_2group[[1]]$train) %in% final_varsSelected_2groups],
test = dataSplit_forValidation_2group[[1]]$test[, colnames(dataSplit_forValidation_2group[[1]]$test) %in%
final_varsSelected_2groups], factor_train = dataSplit_forValidation_2group[[1]]$factor_train,
factor_test = dataSplit_forValidation_2group[[1]]$factor_test, ntree = 500)
confusionMatrix(finalmodel_2groups$Predictions, dataSplit_forValidation_2group[[1]]$factor_test,
positive = "Other")
## Confusion Matrix and Statistics
##
## Reference
## Prediction LNSP Other
## LNSP 20 11
## Other 9 23
##
## Accuracy : 0.6825
## 95% CI : (0.5531, 0.7942)
## No Information Rate : 0.5397
## P-Value [Acc > NIR] : 0.01493
##
## Kappa : 0.3643
##
## Mcnemar's Test P-Value : 0.82306
##
## Sensitivity : 0.6765
## Specificity : 0.6897
## Pos Pred Value : 0.7188
## Neg Pred Value : 0.6452
## Prevalence : 0.5397
## Detection Rate : 0.3651
## Detection Prevalence : 0.5079
## Balanced Accuracy : 0.6831
##
## 'Positive' Class : Other
##
# for both groups including HCG and progesterone
set.seed(800)
finalmodel_2groupsPlusHCGPROG <- RandomForestWithStats(train = dataSplit_forValidation_2group[[1]]$train[,
colnames(dataSplit_forValidation_2group[[1]]$train) %in% c(final_varsSelected_2groups,
"HCG", "Progesterone")], test = dataSplit_forValidation_2group[[1]]$test[,
colnames(dataSplit_forValidation_2group[[1]]$test) %in% c(final_varsSelected_2groups,
"HCG", "Progesterone")], factor_train = dataSplit_forValidation_2group[[1]]$factor_train,
factor_test = dataSplit_forValidation_2group[[1]]$factor_test, ntree = 500)
confusionMatrix(finalmodel_2groupsPlusHCGPROG$Predictions, dataSplit_forValidation_2group[[1]]$factor_test,
positive = "Other")
## Confusion Matrix and Statistics
##
## Reference
## Prediction LNSP Other
## LNSP 24 8
## Other 5 26
##
## Accuracy : 0.7937
## 95% CI : (0.673, 0.8853)
## No Information Rate : 0.5397
## P-Value [Acc > NIR] : 2.478e-05
##
## Kappa : 0.5878
##
## Mcnemar's Test P-Value : 0.5791
##
## Sensitivity : 0.7647
## Specificity : 0.8276
## Pos Pred Value : 0.8387
## Neg Pred Value : 0.7500
## Prevalence : 0.5397
## Detection Rate : 0.4127
## Detection Prevalence : 0.4921
## Balanced Accuracy : 0.7961
##
## 'Positive' Class : Other
##
# Model for both groups with selected metabolites, progesterone, HCG, BMI, Age
# and Gestation
set.seed(800)
finalmodel_2groupsPlusHCGPROG_andMetadata <- RandomForestWithStats(train = cbind(dataSplit_forValidation_2group[[1]]$train[,
colnames(dataSplit_forValidation_2group[[1]]$train) %in% c(final_varsSelected_2groups,
"HCG", "Progesterone")], data_for_limma[dataSplit_forValidation_2group[[1]]$idx,
c("BMI", "Age", "Gestation")]), test = cbind(dataSplit_forValidation_2group[[1]]$test[,
colnames(dataSplit_forValidation_2group[[1]]$test) %in% c(final_varsSelected_2groups,
"HCG", "Progesterone")], data_for_limma[-dataSplit_forValidation_2group[[1]]$idx,
c("BMI", "Age", "Gestation")]), factor_train = dataSplit_forValidation_2group[[1]]$factor_train,
factor_test = dataSplit_forValidation_2group[[1]]$factor_test, ntree = 500)
confusionMatrix(finalmodel_2groupsPlusHCGPROG_andMetadata$Predictions, dataSplit_forValidation_2group[[1]]$factor_test,
positive = "Other")
## Confusion Matrix and Statistics
##
## Reference
## Prediction LNSP Other
## LNSP 25 9
## Other 4 25
##
## Accuracy : 0.7937
## 95% CI : (0.673, 0.8853)
## No Information Rate : 0.5397
## P-Value [Acc > NIR] : 2.478e-05
##
## Kappa : 0.5899
##
## Mcnemar's Test P-Value : 0.2673
##
## Sensitivity : 0.7353
## Specificity : 0.8621
## Pos Pred Value : 0.8621
## Neg Pred Value : 0.7353
## Prevalence : 0.5397
## Detection Rate : 0.3968
## Detection Prevalence : 0.4603
## Balanced Accuracy : 0.7987
##
## 'Positive' Class : Other
##
# Model for both groups with progesterone, HCG, BMI, Age and Gestation (no
# metabolites)
set.seed(800)
finalmodel_2groups_HCGProgMetadata <- RandomForestWithStats(train = cbind(dataSplit_forValidation_2group[[1]]$train[,
colnames(dataSplit_forValidation_2group[[1]]$train) %in% c("HCG", "Progesterone")],
data_for_limma[dataSplit_forValidation_2group[[1]]$idx, c("BMI", "Age", "Gestation")]),
test = cbind(dataSplit_forValidation_2group[[1]]$test[, colnames(dataSplit_forValidation_2group[[1]]$test) %in%
c("HCG", "Progesterone")], data_for_limma[-dataSplit_forValidation_2group[[1]]$idx,
c("BMI", "Age", "Gestation")]), factor_train = dataSplit_forValidation_2group[[1]]$factor_train,
factor_test = dataSplit_forValidation_2group[[1]]$factor_test, ntree = 500)
confusionMatrix(finalmodel_2groups_HCGProgMetadata$Predictions, dataSplit_forValidation_2group[[1]]$factor_test,
positive = "Other")
## Confusion Matrix and Statistics
##
## Reference
## Prediction LNSP Other
## LNSP 22 10
## Other 7 24
##
## Accuracy : 0.7302
## 95% CI : (0.6035, 0.8343)
## No Information Rate : 0.5397
## P-Value [Acc > NIR] : 0.001515
##
## Kappa : 0.461
##
## Mcnemar's Test P-Value : 0.627626
##
## Sensitivity : 0.7059
## Specificity : 0.7586
## Pos Pred Value : 0.7742
## Neg Pred Value : 0.6875
## Prevalence : 0.5397
## Detection Rate : 0.3810
## Detection Prevalence : 0.4921
## Balanced Accuracy : 0.7323
##
## 'Positive' Class : Other
##
# Model for both groups with progesterone and HCG only
set.seed(800)
finalmodel_2groups_HCGProgMetadata <- RandomForestWithStats(train = dataSplit_forValidation_2group[[1]]$train[,
colnames(dataSplit_forValidation_2group[[1]]$train) %in% c("HCG", "Progesterone")],
test = dataSplit_forValidation_2group[[1]]$test[, colnames(dataSplit_forValidation_2group[[1]]$test) %in%
c("HCG", "Progesterone")], factor_train = dataSplit_forValidation_2group[[1]]$factor_train,
factor_test = dataSplit_forValidation_2group[[1]]$factor_test, ntree = 500)
confusionMatrix(finalmodel_2groups_HCGProgMetadata$Predictions, dataSplit_forValidation_2group[[1]]$factor_test,
positive = "Other")
## Confusion Matrix and Statistics
##
## Reference
## Prediction LNSP Other
## LNSP 21 10
## Other 8 24
##
## Accuracy : 0.7143
## 95% CI : (0.5865, 0.8211)
## No Information Rate : 0.5397
## P-Value [Acc > NIR] : 0.003484
##
## Kappa : 0.4279
##
## Mcnemar's Test P-Value : 0.813664
##
## Sensitivity : 0.7059
## Specificity : 0.7241
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.6774
## Prevalence : 0.5397
## Detection Rate : 0.3810
## Detection Prevalence : 0.5079
## Balanced Accuracy : 0.7150
##
## 'Positive' Class : Other
##
# Model for both groups with BMI, Age and Gestation only - results are very
# poor, with routine metadata not able to predict
set.seed(800)
finalmodel_2groups_HCGProgMetadata <- RandomForestWithStats(train = data_for_limma[dataSplit_forValidation_2group[[1]]$idx,
c("BMI", "Age", "Gestation")], test = data_for_limma[-dataSplit_forValidation_2group[[1]]$idx,
c("BMI", "Age", "Gestation")], factor_train = dataSplit_forValidation_2group[[1]]$factor_train,
factor_test = dataSplit_forValidation_2group[[1]]$factor_test, ntree = 500)
confusionMatrix(finalmodel_2groups_HCGProgMetadata$Predictions, dataSplit_forValidation_2group[[1]]$factor_test,
positive = "Other")
## Confusion Matrix and Statistics
##
## Reference
## Prediction LNSP Other
## LNSP 8 14
## Other 21 20
##
## Accuracy : 0.4444
## 95% CI : (0.3192, 0.5751)
## No Information Rate : 0.5397
## P-Value [Acc > NIR] : 0.9496
##
## Kappa : -0.1384
##
## Mcnemar's Test P-Value : 0.3105
##
## Sensitivity : 0.5882
## Specificity : 0.2759
## Pos Pred Value : 0.4878
## Neg Pred Value : 0.3636
## Prevalence : 0.5397
## Detection Rate : 0.3175
## Detection Prevalence : 0.6508
## Balanced Accuracy : 0.4320
##
## 'Positive' Class : Other
##
We further compare prediction outputs with three models using a generalised linear model. In the first model we use selected metabolites, HCG +Progesterone and clinical variables (BMI, Gestation, Age). In the second model we use selected metabolites, HCG and Progesterone. In the third model we use HCG, Progesterone, BMI, Gestation and Age. All models predict really well, with equivalent AUROCs.
labels_ts = dataSplit_forValidation_2group[[1]]$factor_test
labels_tr = ifelse(dataSplit_forValidation_2group[[1]]$factor_train=='LNSP', 0, 1)
labels_ts = ifelse(dataSplit_forValidation_2group[[1]]$factor_test=='LNSP', 0, 1)
LR_everything = fit_LR(data_tr=cbind(dataSplit_forValidation_2group[[1]]$train[,colnames(dataSplit_forValidation_2group[[1]]$train)%in%c(final_varsSelected_2groups,"HCG","Progesterone")],data_for_limma[dataSplit_forValidation_2group[[1]]$idx,c("BMI","Age","Gestation")]),
data_ts =cbind(dataSplit_forValidation_2group[[1]]$test[,colnames(dataSplit_forValidation_2group[[1]]$test)%in%c(final_varsSelected_2groups,"HCG","Progesterone")],data_for_limma[-dataSplit_forValidation_2group[[1]]$idx,c("BMI","Age","Gestation")]),
labels_tr = labels_tr)
LR_metabs_HCGProg = fit_LR(data_tr= as.data.frame(dataSplit_forValidation_2group[[1]]$train[,colnames(dataSplit_forValidation_2group[[1]]$train)%in%c(final_varsSelected_2groups,"HCG","Progesterone")]),
data_ts =as.data.frame(dataSplit_forValidation_2group[[1]]$test[,colnames(dataSplit_forValidation_2group[[1]]$test)%in%c(final_varsSelected_2groups,"HCG","Progesterone")]),
labels_tr = labels_tr)
LR_noMetabs = fit_LR(data_tr=cbind( dataSplit_forValidation_2group[[1]]$train[,colnames(dataSplit_forValidation_2group[[1]]$train)%in%c("HCG","Progesterone")],data_for_limma[dataSplit_forValidation_2group[[1]]$idx,c("BMI","Age","Gestation")]),
data_ts =cbind(dataSplit_forValidation_2group[[1]]$test[,colnames(dataSplit_forValidation_2group[[1]]$test)%in%c("HCG","Progesterone")],data_for_limma[-dataSplit_forValidation_2group[[1]]$idx,c("BMI","Age","Gestation")]),
labels_tr = labels_tr)
LR_HCGProgOnly = fit_LR(data_tr= as.data.frame(dataSplit_forValidation_2group[[1]]$train[,colnames(dataSplit_forValidation_2group[[1]]$train)%in%c("HCG","Progesterone")]),
data_ts =as.data.frame(dataSplit_forValidation_2group[[1]]$test[,colnames(dataSplit_forValidation_2group[[1]]$test)%in%c("HCG","Progesterone")]),
labels_tr = labels_tr)
roc.plot(labels_ts, cbind(LR_everything,LR_metabs_HCGProg,LR_noMetabs,LR_HCGProgOnly),
legend = TRUE,
leg.text = c('All',"Metabs_HCG_Prog", 'NoMetabs',"OnlyHCG&Prog"),
plot.thres = NULL,cex=5,xlab="1-Specificity",ylab="Sensitivity")
Metabolite signals are not enough to predict the four different types of outcome. When the model includes also HCG and Progesterone predictions improve and become slightly better than random, showing some promise.
Alternatively, a model to classify LNSP from Other adverse outcome performs very well. Using only the metabolite signals we are able to build a significant model that improves quite dramatically if HCG and Progesterone are included. In fact only HCG and Progesterone are good predictors on their own. If we also include BMI, Age and Gestation time then model performance improves slightly although these are not good predictors on their own.
Metabolites improve performance just so slightly. This could be due to the caveat is that the signals observed in the NMR data have a very small effect size.